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Abstract. In some reinforcement learning problems an agent may be 
provided with a set of input policies, perhaps learned from prior ex- 
perience or provided by advisors. We present a reinforcement learning 
with policy advice (RLPA) algorithm which leverages this input set and 
learns to use the best policy in the set for the reinforcement learning 
task at hand. We prove that RLPA has a sublinear regret of 0{y/T) 
relative to the best input policy, and that both this regret and its com- 
putational complexity are independent of the size of the state and action 
space. Our empirical simulations support our theoretical analysis. This 
suggests RLPA may offer significant advantages in large domains where 
some prior good policies are provided. 



1 Introduction 

In reinforcement learning an agent seeks to learn a high-reward policy for select- 
ing actions in a stochastic world without prior knowledge of the world dynamics 
model and/or reward function. In this paper we consider when the agent is pro- 
vided with an input set of potential policies, and the agent's objective is to 
perform as close as possible to the (unknown) best policy in the set. This sce- 
nario could arise when the general domain involves a finite set of types of RL 
tasks (such as different user models), each with known best policies, and the 
agent is now in one of the task types but doesn't know which one. Note that 
this situation could occur both in discrete state and action spaces, and in con- 
tinuous state and/or action spaces: a robot may be traversing one of a finite set 
of different terrain types, but its sensors don't allow it to identify the terrain 
type prior to acting. Another example is when the agent is provided with a set 
of domain expert defined policies, such as stock market trading strategies. Since 
the agent has no prior information about which policy might perform best in its 
current environment, this remains a challenging RL problem. 

Prior research has considered the related case when an agent is provided with 
a fixed set of input (transition and reward) models, and the current domain is an 
(initially unknown) member of this set |5I4I2| . This actually provides the agent 
with more information than the scenario we consider (given a model we can 
extract a policy, but the reverse is not generally true) , but more significantly, we 
find substantial theoretical and computational advantages from taking a model- 
free approach. Our work is also closely related to the idea of policy reuse [B], 
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where an agent tries to leverage prior policies it found for past tasks to improve 
performance on a new task; however, despite encouraging empirical performance, 
this work does not provide any formal guarantees. Most similar to our work is 
Talvitie and Singh's ^Mj AtEase algorithm which also learns to select among an 
input set of policies; however, in addition to algorithmic differences, we provide 
a much more rigorous theoretical analysis that holds for a more general setting. 

We contribute a reinforcement learning with policy advice (RLPA) algorithm. 
RLPA is a model-free algorithm that, given an input set of policies, takes an 
optimism-under-uncertainty approach of adaptively selecting the policy that may 
have the highest reward for the current task. We prove the regret of our algorithm 
relative to the (unknown) best in the set policy scales with the square root of the 
time horizon, linearly with the size of the provided policy set, and is independent 
of the size of the state and action space. The computational complexity of our 
algorithm is also independent of the number of states and actions. This suggests 
our approach may have significant benefits in large domains over alternative 
approaches that typically scale with the size of the state and action space, and 
our preliminary simulation experiments provide empirical support of this impact. 

2 Preliminaries 

A Markov decision process (MDP) M is defined as a tuple {S, A, P, r) where 
S is the set of states, A is the set of actions, P : S x A ^ is the tran- 

sition kernel mapping each state-action pair to a distribution over states, and 
r : S X A —i' 7-" ([0,1]) is the stochastic reward function mapping state-action 
pairs to a distribution over rewards bounded in the [0, 1] intervalo A policy tt is 
a mapping from states to actions. Two states Si and Sj communicate with each 
other under policy tt if the probability of transitioning between Si and Sj under 
TT is greater than zero. A state s is recurrent under policy tt if the probability of 
reentering state s under tt is 1. A recurrent class is a set of recurrent states that 
all communicate with each other and no other states. 

We define the performance of tt in a state s as its expected average reward 



where T is the number of time steps and the expectation is taken over the 
stochastic transitions and rewards. If tt induces on M a single recurrent Markov 
class with (possibly) some transient states, then /i'^(s) is constant over all the 
states s e 5, and we can define the bias function A'^ such that 



Its corresponding span is sp{X") ~ maxg A"(s) — min^ A^(s). 

^ The extension to larger bounded regions [0, d] is trivial and just introduces an addi- 
tional d multiplier to the resulting regret bounds. 




(1) 



A"(s) + /i" = E[r(s, 7r(s)) + A"(s')] . 



(2) 
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In reinforcement learning |13j an agent does not know the transition P and/or 
reward r model in advance. Its goal is typically to find a policy tt that maxi- 
mizes its obtained reward. In this paper, we consider reinforcement learning in 
a new MDP M where the learning algorithm is provided with an input set of 
m deterministic policies U — {tti, . . . ,7r,„}. Such an input set of policies could 
arise in multiple situations, including: the policies may represent near-optimal 
policies for a set of m MDPs {Mi, . . . , M^} which may be related to the current 
MDP AI; the policies may be the result of different approximation schemes (i.e., 
approximate policy iteration with different approximation spaces); or they may 
be provided by m advisors. Our objective is to perform almost as well as the 
best policy in the input set U on the new MDP M (with unknown P and/or r). 

A popular measure of the performance of a reinforcement learning algorithm 
over T steps is its regret relative to executing the optimal policy tt* in M. We 
evaluate the regret relative to the best policy tt"*" in the input set 77, 



where rt ~ r{-\st,at) and sq = s. We notice that this definition of regret differs 
from the standard definition of regret by an (approximation) error T(/i* — /i^") 
due to the possible sub-optimality of the policies in 77 relative to the optimal 
policy for MDP M . Further discussion on this definition is provided in Section [51 
Our results require the following mild assumption: 

Assumption 1 There exists a policy 7r+ e 77 inducing a single recurrent Markov 
class with (possibly) some transient states such that when executed in the new 
MDP M, fi'^ = > fJ-'^is) for any state s £ S and any policy tt G 77. The 
bias A"*" of has a span sp(A+) < 77+. 

Note this is a weaker assumption than assuming MDP M is ergodic, which 
would require that the induced Markov chain induced to be both aperiodic and 
recurrent (see [12] for the definition of aperiodic chains): we only require that 
the best policy in the input set must induce a single recurrent class. 

3 Algorithm 

In this section we introduce the Reinforcement Learning with Policy Advice 
(RLPA) algorithm (Alg. [1]). Intuitively, the algorithm seeks to identify and use 
the policy in the input set 77 that yields the highest average reward on the 
current MDP M. As the average reward of each tt G 77 on M, fi^, is initially 
unknown, the algorithm proceeds by estimating these quantities by executing 
the different tt on the current MDP. More concretely, RLPA executes a series of 
trials, and within each trial is a series of episodes. Within each trial the algorithm 
selects the policies in 77 with the objective of effectively balancing between the 
exploration of all the policies in 77 and the exploitation of the most promising 
ones. Our procedure for doing this falls within the popular class of "optimism 
in face uncertainty" methods. To do this, at the start of each episode, we define 
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Algorithm 1 Reinforcement Learning with Pohcy Advice (RLPA) 

Require: Set of policies i7, confidence 5, span function / 

Initialize t — 0, i — 

Initialize n{n) = 1, /i(7r) — 0, -R(7r) = and K{-k) = 1 for all tt G i7 
while t<T do 

Initialize U = 0, Ti = 2\ Hi = H, H = f{Ti) 
i = i + l 

while < Ti & TTi / do (run trial) 

dn) = {H+ 1)^48^^^ + 
B{-k) = P{tv) + c(7r) 
n = arg ma:x,r B{t:) 
v{n) = 1 

while ti<Ti&c v(7r) <n(7r) & 



/^(^) - <c(7r) + (H+l)./48 '^jg^fi, +H A^7\-> do 

(run episode) 

t = t + l,ti=ti + \ 

Take action 7f(st), observe St+i and rt+i 
v(5f) = v{li) + 1 , = + n+i 

end while 

K{li) = K{n) + 1 

if mn) - AT,--. > c(w) + (H+1)./48^^S^^+H then 

ni = ni- {5f} 

end if 

n{n) = n{%) + v(jr) , = 
end while 
end while 



an upper bound on the possible average reward of each policy (Line 8): this 
average reward is computed as a combination of the average reward observed so 
far for this policy //(tt), the number of time steps this policy has been executed 
n(7r) and H, which represents a guess of the span of the best policy, We 
then select the policy with the maximum upper bound tt (Line 9) to run for this 
episode. Unlike in multi-armed bandit settings where a selected arm is pulled 
for only one step, here the MDP policy is run for up to n(7r) steps, i.e., until 
its total number of execution steps is at most doubled. If > iJ+ then the 
confidence bounds computed (Line 8) are valid confidence intervals for the true 
best policy 7r+; however, they may fail to hold for any other policy tt whose 
span sp{X^) > H. Therefore, we can cut off execution of an episode when these 
confidence bounds fail to hold (the condition specified on Line 12), since the 
policy may not be an optimal one for the current MDP, if > iJ"*". In this case, 
we can eliminate the current policy tt from the set of policies considered in this 
trial (see Line 21). After an episode terminates, the parameters of the current 
policy TT (the number of steps n(7r) and average reward /i(7r)) are updated, new 
upper bounds on the policies are computed, and the next episode proceeds. As 
the average reward estimates converge, the better policies will be chosen more. 
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Note that since we do not know in advance, we must estimate it online: 
otherwise, if H is not be a valid upper bound for the span i/+ (see Assumption[T]), 
a trial might eliminate the best policy 7r+, thus incurring a significant regret. 
We address this by successively doubling the amount of time Ti each trial is run, 
and defining a H that is a function / of the current trial length. See Section HTT] 
for a more detailed discussion on the choice of /. This procedure guarantees 
the algorithm will eventually find an upper bound on the span and perform 
trials with very small regret in high probability. Finally, RLPA is an anytime 
algorithm since it does not need to know the time horizon T in advance. 



4 Regret Analysis 

In this section we derive a regret analysis of RLPA and we compare its perfor- 
mance to existing RL regret minimization algorithms. We first derive preliminary 
results used in the proofs of the two main theorems. 

We begin by proving a general high-probability bound on the difference be- 
tween average reward /i'^ and its empirical estimate '^{tt) of a policy tt (through- 
out this discussion we mean the average reward of a policy tt on a new MDP 
M). Let K^-k) be the number of episodes tt has been run, each of them of length 
Vki^^) (fc = 1, . . . , K{'k)). The empirical average /i(7r) is defined as 

A^U) = —r-: > > '^t ' (4) 

n(7r) ^fc=i ^t=i " ^ ' 

where ~ r(-|s^, 7r(sf^)) is a random sample of the reward observed by taking 
the action suggested by tt and n(7r) = '^k^f^i''^) total count of samples. 

Notice that in each episode fc, the first state s\ does not necessarily correspond 
to the next state of the last step Wfc„i(7r) of the previous episode. 

Lemma 1. Assume that a policy tt induces on the MDP M a single recurrent 
class with some additional transient states, i.e., fJ-^{s) = fi^ for all s G S. Then 
the difference between the average reward and its empirical estimate (Eq.^ is 



|JM-."|<2,H- + l,,»,H.i^W, 

y n(Tr) n[Tr) 

with probability > 1 — 6, where ~ sp{X^) (see Eq. 0j. 

Proof. Let r^(sf^) = E(rt*^|sf^, 7r(4)), >^r{t,k) = - r^(4'), and P'' be the state- 
transition kernel under policy vr (i.e. for finite state and action spaces, P^ is the |S| x 15*1 
matrix where the ij-th entry is p{sj\si, n{si))). Then we have 

^ , if (it) uj.(7r) , ^ , if (it) iifc{7r) 

^^^^ ^ ^ ^ " ( ^ ^ + ^ " 

\ fc = l t = l / W \ t^j^ 

, if (it) Ufc (tt) 
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where the second hne foUows from Equation [2] Let ex{t,k) = A''(s*+i) — P''y(st). 
Then we have 



1 



n(7r) 



where we bounded the telescoping sequence X]t(Agfc ~ ^^i^t+i) ^ sp{X^) = H'^ . The 
sequences of random variables {er} and {ea}, as well as their sums, are martingale 
difference sequences. Therefore we can apply Azuma's inequality and obtain the bound 

Kin)H^ + 2y/2n(n) log(l/^) + 2H^ ^2{n{-K) - K(n)) log(l/3) 
fJ-W - M < -r^ 



<H^^ + 2iH^ + l)J'-Mm, 
n(TT) V n(7r) 

with probability > 1 — 5, where in the first inequality we bounded the error terms tr, 
each of which is bounded in [—1, 1], and ex, bounded in [—H'^, H'^]. The other side of 
the inequality follows exactly the same steps. □ 

In the algorithm H'^ is not known and at each trial i the confidence bounds are 
built using the guess on the span H = f{Ti), where / is an increasing function. 
For the algorithm to perform well, it needs to not discard the best policy 7r+ 
(line 21). The following lemma guarantees that after a certain number of steps, 
with high probability the policy is not discarded in any trial. 

Lemma 2. For any trial started after T > r+ = f^^{H^), the probability of 
policy 7r+ to he excluded from Ua at anytime is less than (S/T)^. 

Proof. Let i be the first trial such that Ti > f~^{H^), which implies that H = f{Ti) > 
H'^ . The corresponding step T is at most the sum of the length of all the trials before 
i, i.e., T < Yl%\ 2^ < 2', thus leading to the condition T > T+ = /"^ (//+). After 
T > the conditions in Lemma[T](with Assumption[T} are satisfied for tt^. Therefore 
the confidence intervals hold with probability at least 1 — 5 and we have for fi{-K'^) 



y n(7r+) n{Ti+) V n{Ti+) n{-n+) 

where n{-R^) is number of steps when policy tt"'" has been selected until T. Using a 
similar argument as in the proof of Lemma [T] we can derive 



n(7r+) + i'(7r+) U ri(7r+) + ?;(7r+) 7i(7r+) + f (7r+) ' 

with probability at least 1 — 5. Bringing together these two conditions, and applying 
the union bound, we have that the condition in line 12 holds with at least probability 
1 — 25 and thus tt^ is never discarded. More precisely Algorithm [T] uses slightly larger 
confidence intervals (notably \/ 48 \og(2t/5) instead of 2-\/21og(l/5)), which guarantees 
that tt"^ is discarded with at most a probability of (5/T)^. □ 
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We also need the S-values (line 9) to be valid upper confidence bounds on the 
average reward of the best policy /i+ . 

Lemma 3. For any trial started after T > = f^^{H^), the B -value of the 
optimistic policy tt is an upper bound on fi^ , that is B{tt) > fi* , with probability 
>l-{d/Tf. 

Proof. Lemma [2] guarantees that the policy tt^ is in 77a w.p. {5/Tf. This combined 
with Lemma [1] and the fact that f{T) > 7/^ implies that the 73-value 73(7r''') = 
'p{n'^) + c(7i-"'") is a high-probability upper bound on fi'^ and, since tt is the policy with 
the maximum 73- value, the result follows. □ 

Finally, we bound the total number of episodes a policy could be selected. 

Lemma 4. After T > T+ = f-^{H+) steps of AlgorithmUl let K^n) be the 
total number of episodes tt has been selected and n(7r) the corresponding total 
number of samples, then 

Kin) < log2(/-i(i/+)) +log2(r) +log2(nW) 

with probability > 1 — {S/T)^. 

Proof. Let nkijr) be the total number of samples at the beginning of episode k (i.e., 
nk{n) = X^fc'^i ''^fe' (^))- In each trial of Algorithm [T] an episode is terminated when 
the number of samples is doubled (i.e., nk+i{TT) = 2nfc(7r)), or when the consistency 
condition (last condition in line 12) is violated and the policy is discarded or the trial 
is terminated (i.e., nt+i > nk{n)). We denote by K{n) the total number of episodes 
truncated before the number of samples is doubled, then n(7r) > 2^''^'"^''^^ Since 
the episode is terminated before the number of samples is doubled only when either 
the trial terminates or the policy is discarded, in each trial this can only happen 
once per policy. Thus we can bound K{tt) by the number of trials. A trial can ei- 
ther terminate because its maximum length Ti is reached or when all the polices are 
discarded (line 6). From Lemma [21 we have that after T > f~^{H'^), tt"*" is never 
discarded w.h.p. and a trial only terminates when ti > Ti. Since Ti — 2\ it follows 
that the number of trials is bounded by 7i'(7r) < log2(/~^(77+)) -|-log2(r). So, we have 
n(7r) > 2^('^)-'°S2{/~'(*f+))-i°g2{T)^ which implies the statement of the lemma. □ 

Notice that if we plug this result in the statement of Lemma [U we have that 
the second term converges to zero faster than the first term which decreases as 
0{l/y^n{n)), thus in principle it could be possible to use alternative episode 
stopping criteria, such as ^(Tr) < ^ n(7r). But while this would not significantly 
affect the convergence rate of /i(7r), it may worsen the global regret performance 
in Theorem [TJ 

4.1 Gap-Independent Bound 

We are now ready to derive the first regret bound for RLPA. 
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Theorem 1. Under Assumption[J\ for any T > — f the regret of 

Algorithm]^ is bounded as 

A{s) < 24(/(T) + l)^3Tm{\og{T/S)) + VT + 6fiT)m{\og^{T+) + 2\og^{T)) 
with probability at least 1 ~ S for any initial state s €z S . 

Proof. We begin by bounding the regret from executing each pohcy tt. We consider 
the fc(7r)-th episode when policy tt has been selected (i.e., tt is the optimistic policy 
7?) and we study its corresponding total regret Zl^. We denote by nfe(7r) the number 
of steps of policy tt at the beginning of episode k and «fc(7r) the number of steps in 
episode k. Also at time step T, let the total number of episodes, Vk{T^) and nt, for each 
policy TT be denoted as K{tt), v{n) and n(7r) respectively. We also let tt G 77, B{-k), 
c(7r), 7?(7r) and 'p.in) be the latest values of these variables at time step T for each 
policy TT. Let £ ^ {it ^ f~^{H+), . . . , T, 7r+ G 77a & tt > m"*"} be the event under 
which tt"^ is never removed from the set of policies 77a, and where the upper bound of 
the optimistic policy n, B{n), is always as large as the true average reward of the best 
policy /i"*". On the event £, Att can be bounded as 

k=l t=l fc=l t=l 



(3) 



< 24(/(T) + l)\/3nM log(r/5) + 6f{T)Kin), 

where in (1) we rely on the fact that tt is only executed when it is the optimistic policy, 
and 73 (tt) is optimistic with respect to n'^ according to Lemma [3] (2) immediately 
follows from the stopping condition at line 12 and the definition of c{n). (3) follows 
from the condition on doubling the samples (line 12) which guarantees v{-k) < n{n). 
We now bound the total regret A by summing over all the policies. 

A=J2 24(/(r) + l)x/3n(7r) log(r/5) + 6/(T) ^ K{7t) 



< 24(/(T) + 1) /3m^n(^)log(r/<5) + 6/(T) ^ 7f(^) 

y TTgiT 7rei7 

(2) 



< 24(/(T) + l)\/3mTlog(r/5) + 6/(r)m(Iog2(/-^(77+)) + 21og2(r)), 

where in (1) we use Cauchy-Schwarz inequality and (2) follows from J]]^ n{n) < T, 
LemmalU and log2(n(7r)) < log2(r). 

Since T is an unknown time horizon, we need to provide a bound which holds with 
high probability uniformly over all the possible values of T. Thus we need to deal with 
the case when £ does not hold. Based on Lemma 1 and by following similar lines to (3, 
we can prove that the total regret of the episodes in which the true model is discarded 
is bounded by \/r with probability at least 1 — S/(12T^^'^). Due to space limitations, 
we omit the details, but we can then prove the final result by combining the regret in 
both cases (when £ holds or does not hold) and taking union bound on all possible 
values of T. □ 
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A significant advantage of RLPA over generic RL algoritlims (such as UCRL2) 
is that the regret of RLPA is independent of the size of the state and action 
spaces: in contrast, the regret of UCRL2 scales as 0{S\/ AT)). This advantage 
is obtained by exploiting the prior information that 77 contains good policies, 
which allows the algorithm to focus on testing their performance to identify the 
best, instead of building an estimate of the current MDP over the whole state- 
action space as in UCRL2. It is also informative to compare this result to other 
methods using some form of prior knowledge. In [5] the objective is to learn 
the optimal policy along with a state representation which satisfies the Markov 
property. The algorithm receives as input a set of possible state representation 
models and under the assumption that one of them is Markovian, the algo- 
rithm is shown to have a sublinear regret. Nonetheless, the algorithm inherits 
the regret of UCRL itself and still displays a 0{S^/A) dependency on states and 
actions. In [5 the Parameter Elimination (PEL) algorithm is provided with a set 
of MDPs. The algorithm is analyzed in the PAC-MDP framework and under the 
assumption that the true model actually belongs to the set of MDPs, it is shown 
to have a performance which does not depend on the size of the state-action 
space and it only has a 0{^/m) a dependency on the number of MDPs mlf| In 
our setting, although no model is provided and no assumption on the optimality 
of TT* is made, RLPA achieves the same dependency on m. 

The span sp(A^) of a policy is known to be a critical parameter determining 
how well and fast the average reward of a policy can be estimated using samples 
(see e.g., [1]). In Theorem [1] we show that only the span 77+ of the best policy 
TT''" affects the performance of RLPA even when other policies have much larger 
spans. Although this result may seem surprising (the algorithm estimates the 
average reward for all the policies) , it follows from the use of the third condition 
in line 12 where an episode is terminated, and a policy is discarded, whenever 
the empirical estimates are not consistent with the guessed confidence interval. 
Let us consider the case when H > 77+ but H < sp{\'^) for a policy which is 
selected as the optimistic policy tt. Since the confidence intervals built for tt are 
not correct (see Lemma [1]), tt could be selected for a long while before selecting 
a different policy. On the other hand, the condition on the consistency of the 
observed rewards would discard tt (with high probability), thus increasing the 
chances of the best policy (whose confidence intervals are correct) to be selected. 
We also note that 77+ appears as a constant in the regret through log2(/^^(77+)) 
and this suggests that the optimal choice of / is f{T) = log(T), which would 
lead to a bound of order (up to constants and logarithmic terms) 0{^/Tm + m). 

4.2 Gap-Dependent Bound 

Similar to [7], we can derive an alternative bound for RLPA where the depen- 
dency on T becomes logarithmic and the gap between the average of the best 
and second best policies appears. We first need to introduce two assumptions. 

Notice that PAC bounds are always squared w.r.t. regret bounds, thus the original 
m dependency in jSj becomes 0{^/rn) when compared to a regret bound. 



10 Azar, Lazaric, and Brunskill 



Assumption 2 (Average Reward) Each policy tt e 77 induces on the MDP 
M a single recurrent class with some additional transient states, i.e., {s) — fi'^ 
for all s S. This implies that H'^ = sp{X^) < +oo. 

Assumption 3 (Minimum Gap) Define the gap between the average reward 
of the best policy 7r+ and the average reward of any other policy as /^(tt, s) = 
fi^ — fJ,^ (s) for all s Q S. We then assume that for all tt € U — {tt"*"} and s G S, 
r(7T,s) is uniformly bounded from below by a positive constant F^i^ > 0, i.e., 

r{TT, s) > r,nin- 

Theorem 2 (Gap Dependent Bounds). Let Assumptions\^and\^hold. Run 
Algorithm]^ with the choice of 6 = -^l/T (the stopping time T is assumed to 
be known here). Assume that for all n Cz U we have that H^^ < i/max- Then the 
expected regret of Algorithm\^ after T > T^ = f^^{H^) steps, is bounded as 

mis)) = O L(/(^) + ^n.ax)(log.(r.T) + log,(T+))X 

\ ^ min / 

for any initial state s € S. 

Proof, (sketch) Unlike for the proof of Theorem[T] here we need a more refined control 
on the number of steps of each policy as a function of the gaps r{TT,s). We first 
notice that Assumption [2] allows us to define /^(tt) = r{-K,s) = jj.'^ — fj,^ for any state 
s £ S and any policy tt £ 11. We consider the high-probability event £ = {it = 
f'^{H+), . . . ,r,7r+ e TTa} (see Lemma© where for all the trials run after f~^{H+) 
steps never discard policy tt"*". We focus on the episode at time t, when an optimistic 
policy TT 7^ TT^ is selected for the fc(7r)-th time, and we denote by 71^(5?) the number 
of steps of TV before episode k and Vk{n) the number of steps during episode k{n). 
The cumulative reward during episode k is Rkfjv) obtained as the sum of 'p,k{TT)nk{n) 
(the previous cumulative reward) and the sum of 11^, (7?) rewards received since the 
beginning of the episode. Let £ = {\ft = /"^(//+), . . . ,T,tv+ £ Ha &^ ^ > m"^} be 
the event under which is never removed from the set of policies 11 a, and where the 
upper bound of the optimistic policy Jl, B{n), is always as large as the true average 
reward of the best policy /i^. On event £ we have 



3(i7 + l)j48i^+3^|'i3(W) ^^(^ 



nk{n) nk{n) rife (vr) + (vr) 
>M F^r-; 7^ > M H F^r-; 7^ > (a* - rt) 



Tife (tt) + -Ufe (tt) ^ y nfe(7r) nfc(7r) 

with probability 1— (5/t)®. Inequality (1) is enforced by the episode stopping condition 
in line 12 and the definition of B{tt), (2) is guaranteed by LemmaO (3) relies on the 
definition of gap and Assumption O while (4) is a direct application of Lemma [1] 
Rearranging the terms, and applying Lemma U we obtain 



(W)r^in < (3i? + 3 + H''),/^,/mog{t/5) + 41f"(21og2(t) + \og2{f~\H+))). 
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By solving the inequality w.r.t. nfe(7f) we obtain 

^ (3j? + 3 + H^)y48toiW^ + 2/H^i„(21og,(t) + log,(/-i(iy+)) 
Vn{n) < , (6) 

^ min 

w.p. 1 — (S/t)^. This implies that on the event £, after t steps, RLPA acted according 
to a suboptimal policy tt for no more than 0{log{t)/ F^^^) steps. The rest of the proof 
follows similar steps as in Theorem[l]to bound the regret of all the suboptimal policies in 
high probability. The expected regret of tt"^ is bounded by H'^ and standard arguments 
similar to ,7, are used to move from high-probability to expectation bounds. □ 

Note that although the bound in Theorem [T] is stated in high-probability, it is 
easy to turn it into a bound in expectation with almost identical dependencies 
on the main characteristics of the problem and compare it to the bound of The- 
orem[21 The major difference is that the bound in Eq.[S]shows a 0(log(T)/Pniin) 
dependency on T instead of 0{Vt). This suggests that whenever there is a big 
margin between the best policy and the other policies in 77, the algorithm is 
able to accordingly reduce the number of times suboptimal policies are selected, 
thus achieving a better dependency on T. On the other hand, the bound also 
shows that whenever the policies in 7T are very similar, it might take a long time 
to the algorithm before finding the best policy, although the regret cannot be 
larger than 0{Vt) as shown in Theorem [TJ 

We also note that while Assumption [3] is needed to allow the algorithm to 
"discard" suboptimal policies with only a logarithmic number of steps. Assump- 
tion [5] is more technical and can be relaxed. It is possible to instead only require 
that each policy n € U has a bounded span, 77^^ < oo, which is a milder condition 
than requiring a constant average reward over states (i.e., fJ-^{s) ~ ^'^). 



5 Computational Complexity 

As shown in Algorithm [TJ RLPA runs over multiple trials and episodes where 
policies are selected and run. The largest computational cost in RLPA is at 
the start of each episode computing the 73-values for all the policies currently 
active in Ua and then selecting the most optimistic one. This is an 0{m) op- 
eration. The total number of episodes can be upper bounded by 2 log2 (T) -|- 
\og2{f~^{H'^)) (see Lemma S]). This means the overall computational of RLPA 
is of 0(m(log2(T) -I- log2(/~^(7J+)))). Note there is no explicit dependence on 
the size of the state and action space. In contrast, UCRL2 has a similar number 
of trials, but requires solving extended value iteration to compute the optimistic 
MDP policy. Extended value iteration requires 0(|S'p|A| log (15*1)) computation 
per iteration: if D are the number of iterations required to complete extended 
value iteration, then the resulting cost would be 0(7?|S'p|A| log(|S'|). Therefore 
UCRL2, like many generic RL approaches, will suffer a computational complex- 
ity that scales quadratically with the number of states, in contrast to RLPA, 
which depends linearly on the number of input policies and is independent of 
the size of the state and action space. 
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6 Experiments 

In this section we provide some preliminary empirical evidence of the benefit of 
our proposed approach. We compare our approach with two other baselines. As 
mentioned previously, UCRL2 [7] is a well known algorithm for generic RL prob- 
lems that enjoys strong theoretical guarantees in terms of high probability regret 
bounds with the optimal rate of 0{\/T). Unlike our approach, UCRL2 does not 
make use of any policy advice, and its regret scales with the number of states 
and actions as 0{\S\ yJ[A\). To provide a more fair comparison, we also introduce 
a natural variant of UCRL2, Upper Confidence with Models (UCWM), which 
takes as input a set of MDP models M. which is assumed to contain the actual 
model M . Like UCRL2, UCWM computes confidence intervals over the task's 
model parameters, but then selects the optimistic policy among the optimal poli- 
cies for the subset of models in M. consistent with the confidence interval. This 
may result in significantly tighter upper-bound on the optimal value function 
compared to UCRL2, and may also accelerate the learning process. If the size 
of possible models shrinks to one, then UCWM will seamlessly transition to fol- 
lowing the optimal policy for the identified model. UCWM requires as input a 
set of MDP models, whereas our RLPA approach requires only input policies. 

We consider a square grid world with 4 actions: up (ai), down (02), right 
(03) and left (04) for every state. A good action succeeds with the probability 
0.85, and goes in one of the other directions with probability 0.05 (unless that 
would cause it to go into a wall) and a bad action stays in the same place with 
probability 0.85 and goes in one of the 4 other directions with probability 0.0375. 
We construct four variants of this grid world M = {Mi, M2, M3, M4}. In model 
1 (Ml) good actions are 1 and 4, in model 2 (M2) good actions are 1 and 2, in 
model 3 good actions are 2 and 3, and in model 4 good actions are 3 and 4. All 
other actions in each MDP are bad actions. The reward in all MDPs is the same 
and is —1 for all states except for the four corners which are: 0.7 (upper left), 0.8 
(upper right), 0.9 (lower left) and 0.99 (lower right). UCWM receives as input 
the MDP models and RLPA receives as input the optimal policies of M.. 

We evaluate the performances of each algorithm in terms of the per-step 
regret, A = A/T (see Eq. [3]). Each run is T = 100000 steps and we average the 
performance on 100 runs. The agent is randomly placed at one of the states of 
the grid at the beginning of each round. We assume that the true MDP model 
is M4. Notice that in this case tt* G 7T, thus = /i* and the regret compares 
to the optimal average reward. The identity of the true MDP is not known by 
the agent. For RLPA we set f{t) = \og{t). We construct grid worlds of various 
sizes and compare the resulting performance of the three algorithms. 

Figure [T] shows per-step regret of the algorithms as the function of the num- 
ber of states. As predicted by the theoretical bounds, the per-step regret A of 
UCRL2 significantly increases as the number of states increases, whereas the 
average regret of our RLPA is essentially independent of the state space siz^l- 

^ The RLPA regret bounds depend on the bias of the optimal policy which may be 
indirectly a function of the structure and size of the domain. 
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Fig. 1. Per-step regret versus number of states. 
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Although UCWM has a lower regret than RLPA for a small number of states, 
it quickly loses its advantage as the number of states grows. UCRL2's per-step 
regret plateaus after a small number of states since it is effectively reaching the 
maximum possible regret given the available time horizon. 

To demonstrate the performance of each approach for a single task, Fig- 
ure 2(a) shows how the per-step regret changes with different time horizons for 



a grid-world with 64 states. RLPA demonstrates a superior regret throughout 
the run with a decrease that is faster than both UCRL and UCWM. The slight 
periodic increases in regret of RLPA are when a new trial is started, and all 
policies are again considered. We also note that the slow rate of decrease for all 
three algorithms is due to confidence intervals dimensioned according to the the- 
oretical results which are often over-conservative, since they are designed to hold 
in the worst-case scenarios. Finally, Figure 2(b) shows the average running time 
of one trial of the algorithm as a function of the number of states. As expected, 
RLPA's running time is independent of the size of the state space, whereas the 
running time of the other algorithms increases. 

Though a simple domain, these empirical results support our earlier analysis, 
demonstrating RLPA exhibits a regret and computational performance that is 



14 Azar, Lazaric, and Brunskill 



essentially independent of the size of the domain state space. This is a significant 
advantage over UCRL2, and arises because RLPA can efficiently leverage input 
policy advice. A similar improvement is obtained also with respect to UCWM. 

7 Related Work 

The setting we consider relates to the multi-armed bandit literature, where an 
agent seeks to optimize its reward by uncovering the arm with the best expected 
reward. More specifically, our setting relates to restless 9 and rested jl5j bandits, 
where each arm's distribution is generated by a an (unknown) Markov chain that 
either transitions at every step, or only when the arm is pulled, respectively. 
Unlike either restless or rested bandits, in our case each "arm" is itself a MDP 
policy, where different actions may be chosen. However, the most significant 
distinction may be that in our setting there is a independent state that couples 
the rewards obtained across the policies (the selected action depends on both 
the policy/arm selected, and the state), in contrast to the rested and restless 
bandits where the Markov chains of each arm evolve independently. 

Prior research has demonstrated a significant improvement in learning in a 
discrete state and action RL task whose Markov decision process model parame- 
ters are constrained to lie in a finite set. In this case, an objective of maximizing 
the expected sum of rewards can be framed as planning in a finite-state par- 
tially observable Markov decision process [TU]: if the parameter set is not too 
large, off-the-shelf POMDP planners can be used to yield significant perfor- 
mance improvements over state-of-the-art RL approaches [2,. Other work [S] on 
this setting has proved that the sample complexity of learning to act well scales 
independently of the size of the state and action space, and linearly with the size 
of the parameter set. These approaches focus on leveraging information about 
the model space in the context of Bayesian RL or PAC-style RL, in contrast to 
our model- free approach that focuses on regret. 

There also exists a wealth of literature on learning with expert advice (e.g. [3]). 
The majority of this work lies in supervised learning. Prior work by Diuk et 
al. [4| leverages a set of experts where each expert predicts a probabilistic con- 
cept (such as a state transition) to provide particularly efficient KWIK RL. In 
contrast, our approach leverages input policies, rather than models. Probabilistic 
policy reuse [S] also adaptively selects among a prior set of provided policies, but 
may also choose to create and follow a new policy. The authors present promis- 
ing empirical results but no theoretical guarantees are provided. However, we 
will further discuss this interesting issue of when to leverage expert advice in 
the future work section. 

The most closely related work is by Talvitie and Singh [T3] , who also consider 
identifying the best policy from a set of input provided policies. Talvitie and 
Singh's approach is a special case of a more general framework for leveraging 
experts in sequential decision making environments where the outcomes can 
depend on the full history of states and actions jTT|: however, this more general 
setting provides bounds in terms of an abstract quantity, whereas Talvitie and 



Regret Bounds for Reinforcement Learning with Policy Advice 



15 



Singh provide bounds in terms of the bounds on mixing times of a MDP. There 
are several similarities between our algorithm and the work of Talvitie and Singh, 
though in contrast to their approach we take an optimism under uncertainty 
approach, leveraging confidence bounds over the potential average reward of 
each policy in the current task. However, the provided bound in their paper is 
not a regret bound and no precise expression on the bound is stated, rendering 
it infeasible to do a careful comparison of the theoretical bounds. In contrast, we 
provide a much more rigorous theoretical analysis, and do so for a more general 
setting (for example, our results do not require the MDP to be unichain). Their 
algorithm also involves several parameters whose values must be correctly set 
for the bounds to hold, but precise expressions for these parameters were not 
provided, making it hard to perform an empirical comparison: in the future 
we are interested in treating these as tunable parameters and performing an 
experiments to compare the approaches. 

8 Future Work and Conclusion 

In defining RLPA we preferred to provide a simple algorithm which allowed 
us to provide a rigorous theoretical analysis. Nonetheless, we expect the cur- 
rent version of the algorithm can be easily improved over multiple dimensions. 
The immediate possibility is to perform ofF-policy learning across the policies: 
whenever a reward information is received for a particular state and action, this 
could be used to update the average reward estimate /i(7r) for all policies that 
would have suggested the same action for the given state. As it has been shown 
in other scenarios, we expect this could improve the empirical performance of 
RLPA. However, the implications for the theoretical results are less clear. In- 
deed, updating the estimate /i(7r) of a policy tt whenever a "compatible" reward 
is observed would correspond to a significant increase in the number of episodes 
K{tt) (see Eq. 2]). As a result, the convergence rate of /i(7r) might get worse and 
could potentially degrade up to the point when /i(7r) does not even converge to 
the actual average reward /i^ (sec Lemma [T] when K{tt) ~ "-(tt)). We intend to 
further investigate this in the future. 

Another very interesting direction of future work is to extend RLPA to lever- 
age policy advice when useful, but still maintain generic RL guarantees if the 
input policy space is a poor fit to the current problem. More concretely, currently 
if 7r+ is not the actual optimal policy of the MDP, RLPA suffers an additional 
linear regret to the optimal policy of order T(/i* — /i"*"). If T is very large and 7r+ 
is highly suboptimal, the total regret of RLPA may be worse than UCRL, which 
always eventually learns the optimal policy. This opens the question whether it 
is possible to design an algorithm able to take advantage of the small regret-to- 
best of RLPA when T is small and 7r+ is nearly optimal and the guarantees of 
UCRL for the regret-to-optimal. 

To conclude, we have presented RLPA, a new RL algorithm that leverages 
an input set of policies. We prove the regret of RLPA relative to the best pol- 
icy scales sublinearly with the time horizon, and that both this regret and the 
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computational complexity of RLPA are independent of the size of the state and 
action space. This suggests that RLPA may offer significant advantages in large 
domains where some prior policies (perhaps through past experience with related 
tasks, through various approximations, or domain experts) are available. 
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